my ud120_iML

Data Pre Processing¶

Disciplined Workflow¶

Since data manipulation is easier in pandas, and there are multiple iterative steps ahead, we shall stick to a discipline. Data Pre Processing and Data Standardization are two main stages of data manipulation. After that, we only tune classifiers and evaluate. So for these first two stages, after we load raw data_dict, we shall convert to pandas data frame, and perform operations related to these stages and then before Classifier tuning, we convert it back to my_dataset as needed by our starter code.

In [50]:

from graphviz import Digraph
g = Digraph('Work Flow',  node_attr={'shape': 'record'})
g.attr(rankdir='LR')

g.node('step_0', r'Dict To Df', style='filled', fillcolor='#BBDEFB')
g.node('step_1', r'1. Data\nPreProcessing',style='filled', fillcolor='#e6a26a')
g.node('step_2', r'2. Data\nStandardization', style='filled', fillcolor='#e6a26a')
g.node('step_2_5', r'Df to Dict', style='filled', fillcolor='#BBDEFB')

g.node('step_3', r'3. Classifier\nTuning', style='filled', fillcolor='#e6a26a')
g.node('step_4', r'4. Evaluation\nMetrics', style='filled', fillcolor='#e6a26a')

g.edges([('step_0','step_1'),('step_1','step_2'),('step_2','step_2_5'),('step_2_5','step_3'),('step_3','step_4')])
g

Out[50]:

Jumping into our current code setup, we thus would have something like this to start up. Please recall functions from earlier section, now they are just pushed to library for brevity. When we define new functions,we also will add them to library when we try to reuse next time. Also note, we have switched to Decision tree classifier as decided earlier.

In [51]:

from helpers_2 import init, split_data, tree_classify, evaluate
import pandas as pd

# raw data
data_dict = init()

# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index')  # 'index' means keys should be rows

#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
# 1.2. Remove Outliers

# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------

# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}

# CLASSIFIER
my_dataset = data_dict  # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0)  # random state 0 to maintain consistent results everytime its run

# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)

Udacity's Evaluation:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Accuracy: 0.69210   Precision: 0.23619  Recall: 0.24150 F1: 0.23881 F2: 0.24042
Total predictions: 10000
True positives:  483    False positives: 1562   False negatives: 1517   True negatives: 6438

1.1 Cleaning Invalid Values¶

Let us have a sneak peak in to our raw data

In [52]:

df.head()

Out[52]:

	salary	to_messages	deferral_payments	total_payments	exercised_stock_options	bonus	restricted_stock	shared_receipt_with_poi	restricted_stock_deferred	total_stock_value	...	loan_advances	from_messages	other	from_this_person_to_poi	poi	director_fees	deferred_income	long_term_incentive	email_address	from_poi_to_this_person
ALLEN PHILLIP K	201955	2902	2869717	4484442	1729541	4175000	126027	1407	-126027	1729541	...	NaN	2195	152	65	False	NaN	-3081055	304805	phillip.allen@enron.com	47
BADUM JAMES P	NaN	NaN	178980	182466	257817	NaN	NaN	NaN	NaN	257817	...	NaN	NaN	NaN	NaN	False	NaN	NaN	NaN	NaN	NaN
BANNANTINE JAMES M	477	566	NaN	916197	4046157	NaN	1757552	465	-560222	5243487	...	NaN	29	864523	0	False	NaN	-5104	NaN	james.bannantine@enron.com	39
BAXTER JOHN C	267102	NaN	1295738	5634343	6680544	1200000	3942714	NaN	NaN	10623258	...	NaN	NaN	2660303	NaN	False	NaN	-1386055	1586055	NaN	NaN
BAY FRANKLIN R	239671	NaN	260455	827696	NaN	400000	145796	NaN	-82782	63014	...	NaN	NaN	69	NaN	False	NaN	-201641	NaN	frank.bay@enron.com	NaN

5 rows ' 21 columns

Note the invalid values are marked with 'NaN'. Note that out of all features, except email_address and poi others are numerical. For these numerical values, we learn, 'NaN' means not, not available but 0. Before trying to spot outliers, we need to clean such invalid entries

In [53]:

def clean_df(raw_df):
    """
    Replace 'NaN' with 0 in all feature columns, except 'poi' and 'email_address'
    Return updated df
    """
    # replace NaN except poi and email_address (ref: https://stackoverflow.com/questions/42916989/replace-missing-values-in-all-columns-except-one-in-pandas-dataframe)
    data_panda_1 = raw_df.loc[:, raw_df.columns.difference(['poi','email_address'])].replace('NaN',0, regex=True)
    # caz we have to now concat caz of these cols lost in above df
    data_panda_2 = raw_df[['poi','email_address']]
    # concat (remember axis 1)
    data_panda = pd.concat([data_panda_1, data_panda_2],axis=1, sort=False)
    # fix exponential display formatdue to replace operaiton (ref: https://stackoverflow.com/questions/42735541/customized-float-formatting-in-a-pandas-dataframe)
    pd.options.display.float_format = lambda x : '{:.0f}'.format(x) # we dont both about decimal accuracies here for calculated means, stds etc.    
    return data_panda

test_df = clean_df(df)  
test_df.head()

Out[53]:

	bonus	deferral_payments	deferred_income	exercised_stock_options	expenses	from_messages	from_poi_to_this_person	from_this_person_to_poi	...	other	restricted_stock	restricted_stock_deferred	salary	shared_receipt_with_poi	to_messages	total_payments	total_stock_value	poi	email_address
ALLEN PHILLIP K	4175000	2869717	-3081055	1729541	13868	2195	47	65	...	152	126027	-126027	201955	1407	2902	4484442	1729541	False	phillip.allen@enron.com
BADUM JAMES P	0	178980	0	257817	3486	0	0	0	...	0	0	0	0	0	0	182466	257817	False	NaN
BANNANTINE JAMES M	0	0	-5104	4046157	56301	29	39	0	...	864523	1757552	-560222	477	465	566	916197	5243487	False	james.bannantine@enron.com
BAXTER JOHN C	1200000	1295738	-1386055	6680544	11200	0	0	0	...	2660303	3942714	0	267102	0	0	5634343	10623258	False	NaN
BAY FRANKLIN R	400000	260455	-201641	0	129142	0	0	0	...	69	145796	-82782	239671	0	0	827696	63014	False	frank.bay@enron.com

5 rows ' 21 columns

Now, let us include in our main workflow and check scores..

In [54]:

from helpers_2 import init, split_data, tree_classify, evaluate
from helpers_2 import clean_df
import pandas as pd

# raw data
data_dict = init()

# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index')  # 'index' means keys should be rows

#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
df = clean_df(df)  
# 1.2. Remove Outliers

# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------

# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}

# CLASSIFIER
my_dataset = data_dict  # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0)  # random state 0 to maintain consistent results everytime its run

# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)

Udacity's Evaluation:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Accuracy: 0.69210   Precision: 0.23619  Recall: 0.24150 F1: 0.23881 F2: 0.24042
Total predictions: 10000
True positives:  483    False positives: 1562   False negatives: 1517   True negatives: 6438

No difference, this only means sklearn simply took 'NaN's as 0, or ignored. Nevertheless, our df is now cleaner for spotting outliers.

1.2 Remove Outliers¶

Outliers could be of many types. For numerical features, some of them could be

Those that max out too farther away in either positive or negative direction (can spot in graph)
Those that are out of place (for eg, should be always positive, but outlier is negative)

The 1st type is easy to detect usually by visualizing. However, some may require more than 1 iteration. That is, an outlier may become visibly obvious, only after removing another which was much farther.

The 2nd type is usually harder to detect, as that needs knowledge about the features. For eg, we have total_stock_value, which is supposed to be always positive because its the sum of stock prices which themselves cannot be negative, so a value negative under this could be an outlier. If too negative, it might show up in visuals, but if not, might be difficult to spot.

Outliers thus, by nature, may typically require iterative investigation and feature knowledge apart from just visualization. Also we shall remove only if outlier is not POI

Outlier Iteration 1¶

Let us start spotting by visualizing. For brevity, the visualizing functions are abstracted. The outliers are annotated as well (I had to code upfront who are they to annotate)

We try to spot the outliers of type 1 - maxing out in positive direction

In [55]:

from helpers_2 import stripplots, get_feature_max

# test df for our analysis
test_df = df  

# outliers in this iteration (can create this list after once visualizing without this annotation and choose features from graphs)
outlier_feature_list = [
    'bonus', 'deferral_payments', 'director_fees','exercised_stock_options',
    'expenses','from_messages', 'from_poi_to_this_person', 'loan_advances',
    'long_term_incentive','other','restricted_stock','restricted_stock_deferred',
    'salary','total_payments','total_stock_value'
    ]
outlier_dict = get_feature_max(test_df, outlier_feature_list)

# plot
stripplots(test_df, outlier_dict)

Let us find out who are those outliers, and remove them first.

In [56]:

from helpers_2 import get_index_label

outlier_list = []
for k,v in outlier_dict.iteritems():
    outlier = get_index_label(test_df, k, v)
    outlier_list.append(outlier)

outlier_list = list(set(outlier_list)) # to remove duplicates
outlier_list

Out[56]:

['LAVORATO JOHN J', 'TOTAL', 'KAMINSKI WINCENTY J', 'BHATNAGAR SANJAY']

TOTAL obviously seems to be an entry error, that the TOTAL values of features from source, are also taken as a valid data point by mistake. Others are not POIs anyway, so we go ahead removing them all.

In [57]:

test_df = test_df.drop(outlier_list)

Outlier Iteration 2¶

Now that we have removed some of the outliers, let us visualize again with updated data frame. This time, we also inspect negative max outlier.

In [58]:

from helpers_2 import get_feature_min

# outliers in this iteration (can create this list after once visualizing without this annotation and choose features from graphs)
outlier_feature_list = ['deferral_payments', 'from_messages', 'long_term_incentive' ]
outlier_dict = get_feature_max(test_df, outlier_feature_list)
outlier_dict_min = get_feature_min(test_df, ['restricted_stock_deferred'])  # negative max as seen on graph
outlier_dict.update(outlier_dict_min)

# plot
stripplots(test_df, outlier_dict)

In [59]:

outlier_list = []
for k,v in outlier_dict.iteritems():
    outlier = get_index_label(test_df, k, v)
    outlier_list.append(outlier)

#outlier_list = list(set(outlier_list)) # to remove duplicates
outlier_list

Out[59]:

['KEAN STEVEN J', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'MARTIN AMANDA K']

In [60]:

test_df = test_df.drop(outlier_list)

Outlier Iteration 3¶

We have almost got rid of type 1 outliers. Type 2 was difficult ( I found this only because of referencing how others did it online ). To spot type 2, let us use pandas description feature, instead of visuals.

In [61]:

description = test_df.describe().loc[['min','max','75%','mean']]
print description

       bonus  deferral_payments  deferred_income  director_fees  \
min        0            -102500         -3504386              0   
max  7000000            3131860                0         125034   
75%   797188               5971                0              0   
mean  616983             184554          -168401           9415   

      exercised_stock_options  expenses  from_messages  \
min                         0         0              0   
max                  34348384    228763           4343   
75%                   1624050     52905             47   
mean                  1941625     34584            199   

      from_poi_to_this_person  from_this_person_to_poi  loan_advances  \
min                         0                        0              0   
max                       305                      609       81525000   
75%                        39                       12              0   
mean                       33                       18         593659   

      long_term_incentive    other  restricted_stock  \
min                     0        0                 0   
max               3600000 10359729          14761694   
75%                361719   149831            661377   
mean               279831   234830            843709   

      restricted_stock_deferred  salary  shared_receipt_with_poi  to_messages  \
min                     -560222       0                        0            0   
max                       44093 1111258                     5521        15149   
75%                           0  263257                      862         1326   
mean                     -22440  172338                      635         1060   

      total_payments  total_stock_value  
min                0             -44093  
max        103559793           49110078  
75%          1835865            2180983  
mean         1959256            2762231

Most of the max values found above, even if found too much compared to 75% could still be ignored, because we alreadyd visually analyzed that they are POIs, so nothing much of non-POIs standing out.

Let us check if total_payments and total_stock_value are properly calculated. They are sum of individual related feature columns as shown below.

Type%202%20outlier.png

Checking total payments¶

In [62]:

total_payments_components = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments','loan_advances','other','expenses','director_fees']

# calculate sum of individual components and store in 'total_payments_check'
total_df = test_df[total_payments_components]
total_df = total_df.assign(total_check=test_df[total_payments_components].sum(axis=1)) # thanks: https://stackoverflow.com/questions/33750326/compute-row-average-in-pandas

# compare
import numpy as np
total_df['result'] = np.where( (test_df['total_payments'] == total_df['total_check']) ,'correct','wrong')

total_df['result'].value_counts()

Out[62]:

correct    137
wrong        1
Name: result, dtype: int64

In [63]:

print get_index_label(total_df,'result','wrong')

BELFER ROBERT

Checking total stocks¶

In [64]:

total_stock_components = ['exercised_stock_options','restricted_stock','restricted_stock_deferred']

# calculate sum of individual components and store in 'total_payments_check'
total_df = test_df[total_payments_components]
total_df = total_df.assign(total_check=test_df[total_stock_components].sum(axis=1)) # thanks: https://stackoverflow.com/questions/33750326/compute-row-average-in-pandas

# compare
import numpy as np
total_df['result'] = np.where( (test_df['total_stock_value'] == total_df['total_check']) ,'correct','wrong')

total_df['result'].value_counts()

Out[64]:

correct    137
wrong        1
Name: result, dtype: int64

In [65]:

print get_index_label(total_df,'result','wrong')

BELFER ROBERT

So we have a type 2 outlier here. Let us check if he is a POI.

In [66]:

def is_POI(input_df, index_label):
    """
    Given a data frame and index label this function returns if given index label is POI (true) or not
    """
    return input_df.loc[index_label,'poi']

is_POI(test_df, 'BELFER ROBERT')

Out[66]:

False

He is not a POI. Thus we could remove him as well.

In [67]:

outlier_list = ['BELFER ROBERT']
test_df = test_df.drop(outlier_list)

Outlier Summary¶

Through 3 iterations (2 visual, 1 computative), we have found few outliers which we shall remove and check our scores.

In [68]:

def remove_outliers(input_df, outlier_list):
    return input_df.drop(outlier_list)

In [69]:

from helpers_2 import init, split_data, tree_classify, evaluate
from helpers_2 import clean_df
import pandas as pd

# raw data
data_dict = init()

# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index')  # 'index' means keys should be rows

#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
df = clean_df(df)  
# 1.2. Remove Outliers
outlier_list = ['LAVORATO JOHN J', 'TOTAL', 'KAMINSKI WINCENTY J', 'BHATNAGAR SANJAY']  # iteration 1
outlier_list += ['KEAN STEVEN J', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'MARTIN AMANDA K'] # iteration 2
outlier_list += ['BELFER ROBERT'] # iteration 3
df = remove_outliers(df, outlier_list)

# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------

# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}

# CLASSIFIER
my_dataset = data_dict  # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0)  # random state 0 to maintain consistent results everytime its run

# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)

Udacity's Evaluation:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Accuracy: 0.71656   Precision: 0.34158  Recall: 0.29700 F1: 0.31773 F2: 0.30496
Total predictions: 9000
True positives:  594    False positives: 1145   False negatives: 1406   True negatives: 5855

Though accuracy slighly improved, Precision and Recall fared well compared to scores before Outlier removal.

Note, we still only use one feature salary for classification, which could be reason for low accuracy. In next section we will explore adding more features and scaling them.

So far,

We cleaned the data (replace NaN with 0)
We removed few outliers

This nearly concludes our Data Pre Processing

In [70]:

from graphviz import Digraph
g = Digraph('Work Flow',  node_attr={'shape': 'record'})
g.attr(rankdir='LR')

g.node('step_0', r'Dict To Df', style='filled', fillcolor='#BBDEFB')
g.node('step_1', r'1. Data\nPreProcessing',style='filled', fillcolor='#4be00b')
g.node('step_2', r'2. Data\nStandardization', style='filled', fillcolor='#e6a26a')
g.node('step_2_5', r'Df to Dict', style='filled', fillcolor='#BBDEFB')

g.node('step_3', r'3. Classifier\nTuning', style='filled', fillcolor='#e6a26a')
g.node('step_4', r'4. Evaluation\nMetrics', style='filled', fillcolor='#e6a26a')

g.edges([('step_0','step_1'),('step_1','step_2'),('step_2','step_2_5'),('step_2_5','step_3'),('step_3','step_4')])
g

Out[70]: